What We'll Cover
In Sub-Lesson 1, we explored the landscape of AI tools for data analysis and the broad promise they hold for researchers. This session gets practical. We are going to walk through what actually happens when you hand your data to an AI system and ask it to help you analyse it — the good, the bad, and the dangerously subtle.
The central message of this lesson can be stated simply: AI code that runs without errors is not the same as AI code that produces correct results. This is the "silent error problem," and it is the single most important risk in AI-assisted data analysis. A misspelled variable name throws an error. A wrong variable used in a calculation does not — it just gives you the wrong answer, formatted beautifully.
We will cover data cleaning, exploratory analysis, common silent errors, the irreplaceable role of domain expertise, and the statistical pitfalls that AI can introduce into your research. No programming background is required — the concepts apply whether you work in R, Python, SPSS, or any other tool.
🧹 Data Cleaning with AI
If data analysis were a film, data cleaning would be the months of set construction that nobody sees. Researchers routinely estimate that 80% of the time spent on any data analysis project goes into cleaning and preparation — fixing inconsistencies, handling missing values, reshaping formats, and resolving duplicates. AI can accelerate this work substantially, but only if you understand what it is doing and what it might miss.
What AI Does Well
AI excels at the repetitive, pattern-based aspects of data cleaning. It can scan thousands of rows and flag entries that deviate from expected formats in seconds. Common strengths include:
- Detecting missing values: Identifying blank cells, NA entries, placeholder values like -999 or "N/A" that should be treated as missing
- Spotting formatting inconsistencies: Dates stored as "12/03/2024" in some rows and "2024-03-12" in others, or mixed use of commas and periods as decimal separators
- Standardising entries: Recognising that "Cape Town", "CAPE TOWN", "cape town", and "CT" all refer to the same city
- Identifying duplicates: Finding rows that are exact copies or near-duplicates with minor variations
- Statistical outlier detection: Flagging values that fall far outside the expected range based on the distribution of the data
What AI Can Miss
AI operates on statistical patterns, not domain understanding. This creates a fundamental blind spot: it cannot distinguish between a value that is unusual and a value that is wrong. However, with the right prompting, you can get it to find these issues. Consider these examples:
- Domain-specific errors: A blood pressure reading of 300/200 is not an outlier — it is an impossibility. A resting heart rate of 180 in an adult is almost certainly a recording error. AI sees numbers; you see clinical meaning. If you tell it what to look out for, it will likely spot such errors.
- Contextual plausibility: A salary of R2,000,000 per month might be a valid entry for a CEO or a data entry error for a junior employee. The same number means different things in different rows. Again, telling it to audit the data is likely to turn up this inconsistency.
- Coded values with meaning: In many datasets, specific values carry special meaning. A "0" might mean "not measured" rather than "zero." A "-1" might be a sentinel value for "not applicable." AI will treat these as ordinary numbers unless you tell it otherwise.
- Relationships between fields: A participant listed as 15 years old with a PhD should raise questions. AI may clean each column independently without checking cross-field consistency, unless you are careful.
⚠️ The Danger of Silent Fixes
Perhaps the most significant risk in AI-assisted data cleaning is that the AI "fixes" your data without telling you what it changed. You ask it to clean a dataset, it returns a clean dataset, and you proceed to analysis — never knowing that it dropped 47 rows it deemed problematic, imputed values for 200 missing entries using a method you would not have chosen, or silently converted a categorical variable to a numeric one.
Best practice: Always ask AI to produce a transformation log — a complete record of every change it made, why it made it, and how many records were affected. Review this log before accepting any cleaned dataset. If the AI cannot tell you exactly what it changed, do not use the cleaned data.
📝 Practical Tip: The Before-and-After Check
Before AI touches your data, record basic summary statistics: row count, column count, mean and range for key numeric variables, frequency counts for key categorical variables. After cleaning, check these again. If the row count dropped by 15% and the mean of your primary outcome variable shifted, you need to understand exactly why before proceeding. These are sanity checks that take seconds and can save you from building an entire analysis on corrupted foundations.
🔍 Exploratory Data Analysis (EDA) with AI
Exploratory data analysis is the phase where you get to know your data — its shape, its quirks, its stories. AI tools promise to accelerate this phase by generating summary statistics and visualisations from natural language descriptions. The promise is real, but it comes with important caveats.
The Promise
Imagine being able to type: "Show me the distribution of patient recovery times, broken down by treatment group, with outliers highlighted and a statistical comparison between groups." Modern AI tools can take a request like this and generate the appropriate code, produce the visualisation, calculate summary statistics, and even offer preliminary interpretations — all in seconds.
For researchers without deep programming expertise, this is transformative. Tasks that once required hours of coding and debugging can now be accomplished through conversation. AI can generate dozens of views of your data in the time it would take you to produce one, enabling a breadth of exploration that was previously impractical.
The Reality
Speed is not the same as insight. Several problems commonly arise when AI drives the EDA process:
- Inappropriate visualisations: AI may produce a bar chart when a box plot would be more informative, or use a pie chart (almost never the right choice) for data that demands a different representation. It optimises for what looks clean, not for what reveals the most about your data
- Missing important relationships: AI generates the plots you ask for, but it cannot anticipate which relationships in your data are scientifically interesting. It will not spontaneously check whether your treatment effect varies by site, or whether your time series has a seasonal pattern — unless you think to ask
- Over-interpreting noise: AI may describe random variation as a "trend" or a "pattern." When you ask it to summarise what it sees, it will find something to say even when the honest answer is "there is nothing notable here"
- Default choices that obscure: AI makes dozens of small decisions in every plot — axis scales, bin widths, colour schemes, whether to show individual data points or just summaries. Each of these choices can hide or reveal important features of your data, and AI defaults are not always the best choices for your specific analysis
🔬 Walkthrough: An EDA Workflow with AI Assistance
Suppose you have a dataset of 500 survey responses measuring attitudes toward AI in education across three South African universities, with responses on a 5-point Likert scale across 20 items, along with demographic variables (age, gender, faculty, year of study).
Step 1 — Let AI generate the overview: Ask for basic summary statistics, missing data patterns, and distributions of all variables. This is where AI shines — it generates the mechanical overview quickly and accurately.
Step 2 — You decide what to explore: The AI's overview shows that one university has a much higher non-response rate on certain items. This is interesting — but only because you know that university recently had a controversy about AI-assisted plagiarism. AI would not make this connection.
Step 3 — Direct targeted analysis: Ask AI to break down responses by university, controlling for faculty. Ask it to check whether the non-response pattern correlates with specific survey items. You are now driving the analysis based on domain knowledge that the AI does not have.
Step 4 — Challenge the output: AI produces a plot showing a "significant difference" between universities. Before you accept this, check: Did it account for the different sample sizes? Did it use the right statistical test for Likert-scale data? Did it adjust for multiple comparisons? These are questions you must ask — AI will not volunteer that its default choices might be inappropriate.
The pattern: AI generates, you direct, you verify. At no point does the AI decide what matters — that is your job.
💡 The Golden Rule of AI-Assisted EDA
AI is excellent at generating many views of your data quickly. But you decide which views matter. The tool can show you everything; it cannot tell you what is important. Your research question, your domain knowledge, and your scientific judgment determine which of the AI's many outputs deserve further investigation and which are noise. Exploration without direction is just generating pictures.
🚨 The Silent Error Problem
This is the most important section of this lesson.
Modern agentic tools like Claude Code are excellent at handling crashes — they run the code, see the error, and iterate autonomously until it runs. That class of problem is largely solved. But there is a deeper and more dangerous category of failure: code that runs perfectly — no errors, no warnings, clean output — and produces results that are completely wrong. This is the silent error problem. Unlike a crash, there is nothing to alert you that something has gone wrong. The analysis proceeds, the paper gets written, and the error only surfaces later — if it surfaces at all.
📋 Case Study: The Analysis That Looked Perfect
A researcher asks an AI to analyse the relationship between a new teaching intervention and student exam scores. The AI produces clean, well-commented code. It generates a beautiful regression table showing a statistically significant positive effect (p = 0.003). The researcher is delighted.
But there is a problem. The dataset contains two similarly named columns: final_score (the actual exam result) and final_score_predicted (a model prediction from a previous analysis). The AI used the predicted scores instead of the actual scores. The regression is essentially correlating the intervention with a previous model's output — not with real student performance.
The code ran without errors. The output looked professional. The p-value was "significant." And every conclusion drawn from the analysis was wrong.
This is not a hypothetical scenario. It is the kind of mistake that happens routinely when AI selects variables based on column names rather than semantic understanding of what each column represents.
Why does this happen? AI generates code that is statistically plausible — it follows valid programming syntax and applies real statistical methods. But it does not understand the meaning of your data. It does not know that column B and column C represent fundamentally different things if their names are similar. It optimises for code that runs, not code that answers your specific research question correctly.
Wrong Variable Selection
AI picks a variable based on its name rather than its meaning. If your dataset has income , income_adjusted , and income_log , the AI may choose whichever seems most generic — not the one that is appropriate for your analysis. This is especially dangerous when variable names are abbreviated or ambiguous, as they often are in real research datasets.
Off-by-One Errors in Time Series
When aligning time series data, AI may shift your data by one time period without you noticing. If you are measuring the effect of a policy implemented in March, and the AI aligns outcomes starting from February or April, your entire causal story changes. These alignment errors produce results that look plausible but point to the wrong time period.
Incorrect Missing Data Handling
There are three fundamentally different approaches to missing data: dropping incomplete cases, imputing values, and using methods that account for missingness. Each produces different results, and the right choice depends on why data are missing — something AI cannot determine. If your data are missing because sicker patients dropped out of a study, dropping those cases biases your results. AI does not know this unless you tell it.
Wrong Statistical Test
AI selects a parametric test when your data are non-normal, or uses a test that assumes independence when your observations are paired. It may run a t-test on Likert scale data, apply a Pearson correlation to ordinal variables, or use linear regression when the relationship is clearly non-linear. The output will still include a p-value and a confidence interval — they will just be meaningless.
Grouping and Aggregation Errors
When AI groups or aggregates your data, it may inadvertently change the unit of analysis. If you have repeated measurements per participant and the AI calculates a mean across all observations rather than first averaging within participants, you have pseudo-replication — your effective sample size is inflated, your standard errors are too small, and your p-values are too optimistic. The analysis looks more impressive than it should.
Data Leakage
In predictive modelling, AI may inadvertently include information from the future or from the outcome variable in the features used for prediction. Kapoor and Narayanan (2023) documented how this kind of data leakage has compromised a significant body of machine-learning research across multiple scientific fields. The models report excellent performance — because they are effectively cheating by using information they should not have access to.
💡 The Key Lesson
A correct-looking output is not a correct output. The most dangerous errors in AI-assisted analysis are the ones that produce professional, publication-ready results that happen to be wrong. Your defence against silent errors is not better AI — it is better verification practices. Check what variable the AI used. Check how it handled missing data. Check that the statistical test matches your data structure. Check the units. Check the direction of effects against your expectations. If you cannot explain every step of the analysis, you do not understand the analysis — and you should not publish it.
📖 Key Reading: Kapoor & Narayanan (2023)
The paper "Leakage and the Reproducibility Crisis in Machine-Learning-Based Science" published in Patterns documents how data leakage — where information from outside the training set improperly influences model development — has affected research across 17 scientific fields. The authors found that papers with leakage reported substantially inflated performance metrics. This is not an obscure technical problem: it represents a systematic way in which computational analysis can go wrong, producing results that look strong but do not replicate. The paper provides a taxonomy of leakage types and practical guidelines for avoiding them. It is essential reading for anyone using AI or machine learning in their research.
🧠 Domain Expertise as the Essential Complement
AI finds patterns. That is what it does, and it does it faster than any human. But finding a pattern is not the same as finding a meaningful pattern. The entire history of science is littered with patterns that turned out to be coincidences, artefacts, or confounders. Your domain expertise is what separates genuine discoveries from statistical noise.
Spurious Correlations: When Patterns Lie
Tyler Vigen's Spurious Correlations project provides a vivid illustration of this problem. He documents strong statistical correlations between completely unrelated variables: the divorce rate in Maine and per capita consumption of margarine, for instance, or the number of people who drowned in pools and the number of films Nicolas Cage appeared in. These correlations are real — the numbers genuinely track together — but they are meaningless. No causal mechanism connects them.
AI will find these kinds of correlations in your data. It will present them with confidence intervals and p-values. It will not tell you that they are nonsense — because it cannot distinguish a meaningful relationship from a coincidental one. That distinction requires understanding the subject matter: the biology, the economics, the psychology, the physics. It requires knowing what should be related and what should not.
🔬 Simpson's Paradox: When Aggregation Deceives
Simpson's paradox occurs when a trend that appears in aggregated data reverses or disappears when the data are separated into subgroups. This is not a rare curiosity — it is surprisingly common in real research data, and AI tools will typically show you whichever version of the data you ask for without flagging the paradox.
Classic example: A treatment appears to be less effective than a control overall. But when you break the data down by severity of illness, the treatment is better in every subgroup . The paradox arises because sicker patients were more likely to receive the treatment, creating a confound in the aggregate data.
If you ask AI to "compare outcomes between treatment and control groups," it will give you the aggregate comparison — the misleading one. It will not spontaneously check for Simpson's paradox unless you know to ask. And knowing to ask requires understanding that severity might be a confounding variable — which requires domain expertise.
Overfitting: Learning Noise
When AI models are given too much freedom to find patterns, they "learn" the noise in your data rather than the underlying signal. An overfitted model will perform brilliantly on your existing data and fail completely on new data. AI will not warn you that this is happening — the training metrics will look excellent. You need to know enough about your data and your methods to insist on proper validation: holdout sets, cross-validation, and out-of-sample testing. The AI will do these things if you ask, but it will not always do them by default.
Confounding Variables
A confounding variable is one that influences both the supposed cause and the supposed effect, creating the illusion of a direct relationship. Ice cream sales and drowning deaths both increase in summer — not because ice cream causes drowning, but because hot weather drives both. AI cannot identify confounders without domain knowledge. It will happily model the direct relationship between ice cream and drowning if that is what the data show. Only you know which variables might be confounders in your specific research context.
💡 The Researcher's Irreplaceable Role
AI can process data faster, generate more plots, and test more hypotheses than any human researcher. But it cannot do three things that are fundamental to research:
1. Know what questions to ask. The most important moment in any analysis is deciding what to analyse. AI can answer questions; it cannot identify which questions matter.
2. Know what results are surprising. A surprising result — one that contradicts existing theory or prior findings — is where scientific progress happens. But recognising surprise requires knowing what was expected, and that requires deep familiarity with your field.
3. Know what findings would change the field. Not all statistically significant results are scientifically important. Understanding the implications of a finding — how it fits into or challenges the existing body of knowledge — is an inherently human judgment that no amount of data processing can replace.
⚠️ Statistical Pitfalls AI Can Introduce
Beyond the silent errors in code, AI can introduce more subtle statistical problems into your analysis. These are not bugs — they are features of how AI approaches data analysis that can systematically inflate your confidence in findings that may not be real.
- Confusing Correlation with Causation
AI will present correlations as "findings" without distinguishing them from causal relationships. If you ask "what factors predict student success?", AI will identify variables that correlate with success — but correlation is not causation. A variable that predicts an outcome is not necessarily a variable that influences it. AI will not make this distinction for you. It will present a list of predictors and let you draw the causal conclusions — correctly or not.
- The Multiple Comparisons Problem
If you test 20 hypotheses at the p < 0.05 level, you expect to find one "significant" result by chance alone — even if none of the hypotheses is true. AI makes it trivially easy to test dozens or hundreds of hypotheses in minutes. Without correction for multiple comparisons (Bonferroni, false discovery rate, or similar), the more you test, the more "significant" results you will find by chance. AI will dutifully report each one as significant unless you specifically instruct it to correct for multiple testing.
- Cherry-Picking Results
When AI summarises an analysis, it tends to highlight the most dramatic finding — the largest effect size, the smallest p-value, the most visually striking pattern. This is not necessarily the most important or most reliable finding. If you run a comprehensive analysis and only report the highlights that AI emphasises, you are cherry-picking — even if you did not intend to. The remedy is to report the full analysis, including null results and weak effects, not just the headline numbers.
- Accidental P-Hacking
P-hacking occurs when analysts iterate through different analysis choices until they find a statistically significant result. With AI, this can happen almost unconsciously. You ask for an analysis, the result is non-significant, so you ask AI to "try a different approach" or "check if there's a better way to model this." Each iteration represents a new analysis decision, and if you keep iterating until p < 0.05, you have p-hacked — even though you never deliberately set out to do so. The speed and ease of AI-assisted iteration makes this especially insidious.
- The Garden of Forking Paths
Every data analysis involves dozens of decisions: how to handle outliers, which covariates to include, how to define subgroups, which time period to analyse, what transformation to apply. Each decision creates a "fork" in the analytical path. AI makes these decisions for you, often without flagging that alternatives exist. The problem is that different reasonable choices can lead to different conclusions. If AI happens to choose the combination of decisions that produces a significant result, you may not realise that equally reasonable alternative choices would have produced a null result. Pre-registration of your analysis plan is the strongest defence against this problem.
⚠️ The Speed Trap
All of these pitfalls share a common amplifier: speed. When analysis takes hours or days, researchers naturally consider each step carefully. When AI can produce a complete analysis in seconds, the temptation to iterate rapidly — trying different models, subsets, and specifications until something "works" — becomes almost irresistible. The speed of AI-assisted analysis is a feature, but it is also a risk multiplier for every statistical pitfall on this list. Slow down. Think about each result before requesting the next analysis.
📚 Readings and Resources
Core Readings
- Kapoor, S. & Narayanan, A. (2023). "Leakage and the Reproducibility Crisis in Machine-Learning-Based Science." Patterns , 4(9).
The definitive paper on how data leakage compromises ML-based research across scientific fields. Documents systematic problems with computational analysis that produce impressive-looking but unreliable results. Essential for understanding why AI-generated analysis requires careful verification.
- Cheng, L., Li, X., & Bing, L. (2023). "Is GPT-4 a Good Data Analyst?" arXiv:2305.15038.
An empirical evaluation of GPT-4's capabilities for end-to-end data analysis tasks. The paper systematically tests the model's ability to handle data cleaning, visualisation, and statistical analysis, revealing both genuine strengths and consistent weaknesses. Provides concrete evidence for the claims made in this lesson about what AI can and cannot do with data.
- Vigen, T. "Spurious Correlations."
A vivid and memorable demonstration of why correlation does not imply causation. Browse the examples before class — they are entertaining, but the underlying lesson is serious. AI will find these kinds of correlations in your data and present them as findings.
Supplementary Readings
- Narayanan, A. & Kapoor, S. AI as Normal Technology (formerly AI Snake Oil). Ongoing commentary from the authors of the leakage paper, covering the broader landscape of AI hype and how to distinguish real capabilities from exaggerated claims. Highly recommended for developing critical judgment about AI tools. normaltech.ai
- Mollick, E. One Useful Thing (Substack). Practical, research-informed commentary on AI in education and research from a Wharton professor. Mollick writes extensively about using AI for data analysis with appropriate caution about limitations. oneusefulthing.org
Summary
This lesson has focused on one core reality: AI can accelerate every stage of data analysis, from cleaning to exploration to statistical testing, but it introduces risks at every stage as well. The most dangerous of these risks are not the ones that produce error messages — they are the ones that produce clean, professional, wrong results.
Data cleaning with AI is powerful but requires transformation logs and domain-informed sanity checks. Exploratory analysis with AI generates breadth but not direction — you must supply the scientific judgment about what matters. Silent errors in AI-generated code are pervasive and can only be caught through careful verification. Domain expertise remains the irreplaceable complement to AI's computational power: it is what allows you to distinguish signal from noise, meaningful from spurious, and important from trivial. And the statistical pitfalls that AI can introduce — from multiple comparisons to p-hacking to the garden of forking paths — are amplified by the very speed that makes AI-assisted analysis attractive.
The bottom line is not that you should avoid AI for data analysis. It is that you should use it with your eyes open, your verification practices in place, and your domain expertise fully engaged.
Next session: In Sub-Lesson 3 (Visualization with AI), we will explore how AI can help you create effective data visualisations — and how the same tool that makes beautiful charts can also make misleading ones. We will cover principles of honest visualisation, common AI defaults that distort data, and how to build graphics that communicate your findings accurately.